5 research outputs found
Perfect is the enemy of test oracle
Automation of test oracles is one of the most challenging facets of software
testing, but remains comparatively less addressed compared to automated test
input generation. Test oracles rely on a ground-truth that can distinguish
between the correct and buggy behavior to determine whether a test fails
(detects a bug) or passes. What makes the oracle problem challenging and
undecidable is the assumption that the ground-truth should know the exact
expected, correct, or buggy behavior. However, we argue that one can still
build an accurate oracle without knowing the exact correct or buggy behavior,
but how these two might differ. This paper presents SEER, a learning-based
approach that in the absence of test assertions or other types of oracle, can
determine whether a unit test passes or fails on a given method under test
(MUT). To build the ground-truth, SEER jointly embeds unit tests and the
implementation of MUTs into a unified vector space, in such a way that the
neural representation of tests are similar to that of MUTs they pass on them,
but dissimilar to MUTs they fail on them. The classifier built on top of this
vector representation serves as the oracle to generate "fail" labels, when test
inputs detect a bug in MUT or "pass" labels, otherwise. Our extensive
experiments on applying SEER to more than 5K unit tests from a diverse set of
open-source Java projects show that the produced oracle is (1) effective in
predicting the fail or pass labels, achieving an overall accuracy, precision,
recall, and F1 measure of 93%, 86%, 94%, and 90%, (2) generalizable, predicting
the labels for the unit test of projects that were not in training or
validation set with negligible performance drop, and (3) efficient, detecting
the existence of bugs in only 6.5 milliseconds on average.Comment: Published in ESEC/FSE 202
White-box Compiler Fuzzing Empowered by Large Language Models
Compiler correctness is crucial, as miscompilation falsifying the program
behaviors can lead to serious consequences. In the literature, fuzzing has been
extensively studied to uncover compiler defects. However, compiler fuzzing
remains challenging: Existing arts focus on black- and grey-box fuzzing, which
generates tests without sufficient understanding of internal compiler
behaviors. As such, they often fail to construct programs to exercise
conditions of intricate optimizations. Meanwhile, traditional white-box
techniques are computationally inapplicable to the giant codebase of compilers.
Recent advances demonstrate that Large Language Models (LLMs) excel in code
generation/understanding tasks and have achieved state-of-the-art performance
in black-box fuzzing. Nonetheless, prompting LLMs with compiler source-code
information remains a missing piece of research in compiler testing.
To this end, we propose WhiteFox, the first white-box compiler fuzzer using
LLMs with source-code information to test compiler optimization. WhiteFox
adopts a dual-model framework: (i) an analysis LLM examines the low-level
optimization source code and produces requirements on the high-level test
programs that can trigger the optimization; (ii) a generation LLM produces test
programs based on the summarized requirements. Additionally,
optimization-triggering tests are used as feedback to further enhance the test
generation on the fly. Our evaluation on four popular compilers shows that
WhiteFox can generate high-quality tests to exercise deep optimizations
requiring intricate conditions, practicing up to 80 more optimizations than
state-of-the-art fuzzers. To date, WhiteFox has found in total 96 bugs, with 80
confirmed as previously unknown and 51 already fixed. Beyond compiler testing,
WhiteFox can also be adapted for white-box fuzzing of other complex, real-world
software systems in general
Recommended from our members
Advancing Energy Testing of Mobile Applications
The rising popularity of mobile apps deployed on battery-constrained devices has motivated the need for effective and efficient energy-aware testing techniques. However, currently there is a lack of test generation tools for exercising the energy properties of apps. Automated test generation is not useful without tools that help developers to measure the quality of the tests. Additionally, the collection of tests generated for energy testing could be quite large, as it may involve a test suite that covers all the energy-greedy parts of the code under different use-cases. Thereby, there is a need for techniques to manage the size of test suite, while maintaining its effectiveness in revealing energy defects. This research proposes a four-pronged approach to advance energy testing for mobile applications, including techniques for energy-aware test input generation, energy-aware test oracle construction, energy-aware test-suite adequacy assessment, and energy-aware test-suite minimization
Transforming test suites into croissants
Software developers often rely on regression testing to ensure that recent changes made to the source code do not introduce bugs. Flaky tests, which non-deterministically pass or fail regardless of any change to the code, can negatively impact the effectiveness of the regression testing. While state-of-the-art is advancing the techniques for test-flakiness detection and mitigation, the community is missing a systematic approach for generating high-quality benchmarks of flaky tests to compare the effectiveness of such techniques. Inspired by the power of mutation testing in evaluating the fault-detection ability of tests, this paper proposes Croissant, a framework for injecting flakiness into the test suites to assess the effectiveness of test-flakiness detection tools in finding these tests. Croissant implements 18 flakiness-inducing mutation operators. We designed these operators to allow controlling the non-determinism involved in flakiness, i.e., making many mutants deterministically pass or fail to observe flaky behavior. Our extensive empirical evaluation of Croissant on the test suites of 15 real-world projects confirms the ability of designed mutation operators to generate high-quality mutants, and their effectiveness in challenging test-flakiness detection tools in revealing flaky tests